i) Account : Account number for transactional purpose unique code (for cashiering and accounting personnel to reference).
ii) Guest Name : Name of a guest staying at hotel.
iii) Arrive : Guest arrival (entry) date.
iv) Depart : Guest exit date.
v) Nights : Total number of nights (days) a guest is staying.
vi) Status : Room current status
O (Occupied), R (Reserved), I (Inspected), N (No-Show), C (Cancelled) 1) No-Show: A reservation becomes a No-Show when the customer who has a guaranteed reservation does not cancel it before the hotel’s cancellation deadline, and never arrives to claim the reservation.
2) Occupied: Customer booked the room and actually he arrived as per check-in date.
3) Reserved: If the guest has not claimed or canceled the booked room by the specified time with hotel policies, you must hold the reserved rooms available until check-out time the following day.
4) Canclled: Customer has been cancelled the booked room successfully within the time period of cancellation policies.
5) Inspected: The main Purpose of a room inspection is to catch any problems that may have been overlooked during the cleaning before it is found by the guest and becomes a dissatisfaction and complaint. After guest check-out, manager is responsible for the room inspection before it allocated to new customer.
vii) Rate : The price of hotel charges for overnight accommodations (Room rates).
viii) Rate code : Room Rates and rate codes may also vary according to the available room features such as room size, location, view, furnishing, decors, competitors pricing. Similar to below references,
1) Early-bird Rate - This type of rates are only open X days before arrival. Eg: Open only when 7 Days before arrival.
2) Package Rate: Rates that includes a guest room in combination with other available events or activities. ( Eg: Best of London package which includes room rental, all meals, site seeing, airport transfers etc.)
3) Best Available Rates (BAR): These rate codes are the lowest discounted rate available for a day which can be offered to the guest by the Reservation or Front desk staff. BAR can be of different types
BAR Level Open / Close when Occupancy Between DATE DAY RATE
BAR - 01 0% TO 25 % 19/10/2016 WED 200.00
BAR - 02 26 % TO 35 % 20/10/2016 THU 200.00
BAR - 03 36% TO 50% 21/10/2016 FRI 250.00
BAR - 04 51% TO 75% 22/10/2016 SAT 250.00
BAR - 05 76% TO 100% 23/10/2016 SUN 250.00
24/10/2016 MON 200.00
25/10/2016 TUE 200.00
4) Family Rate: A rate reserved for families with children. Usually, these rates include Extra Bed charges and may also include some free add-on activities for children
ix) Room : Respetive room numbers assigned to rooms as per the hotel configurations.
x) Type :
1) NQQ - Double Queen Rooms (Non Smoking 2 Queen bed)
2) NK - Single King Rooms (Non Smoking Single King bed)
3) SNQQ - Double Queen Suites (Suites Non Smoking 2 Queen bed)
4) SNQQ1 - Double Queen Suites (Suites Non Smoking 2 Queen bed)
5) SNK - Single King Suites (Suites Non Smoking Single King bed)
6) NHQQ - Double Queen Room Handicap Equipped (Non Smoking Handicap 2 Queen bed)
7) NHK - Single King Room Handicap Equipped (Non Smoking Handicap Single King bed)
8) SNHK - Single King Suites Handicap Equipped (Suites Non Smoking Handicap Single King bed)
Note: i) 1 Queen bed --> Bed = 1 | Max adults = 2
ii) 2 Queen bed --> Bed = 2 | Max adults = 4
ref: https://www.comfortinnofblueridge.com/accommodations/rooms/
xi) Source : The room booking mode which is used by guest
1) CRS Central Reservation System (online mode) :15714 2) DIRECT Direct (direct mode) :9330 3) WI Walk In (direct mode) :2423 xii) CRS Conf No : N.A
xiii) GTD : N.A
Guaranteed --> Prior to a function, the figure given by a meeting planner to the property for the number of persons to be served.
VI 7565
MC 4808
CC 4704
DB 3661
CA 2985
GM 2786
AX 673
DS 255
DP 13
CK 9
4P 6
xiv) Reserve Date : Room reservation date (On this day, guest reserve/ book the rooms by entering details as arrival, depart, adults, bed, etc)
xv) User : Receiption desk person id which is responsible for handling the guest activities
xvi) Shared Account : Is any shared / same account used for booking.
xvii) Track Code : Purpose for the room booking
Corporate (Meeting, signing, dealing) Leisure (Rest) Walk-in (Other purposes on quick decision) xviii) Package : Complete package cost for the guest including per person charge, inclusive of guest room, food and beverage, and proportional percentage of all other event charges.
(Packages seasonal - Winter, Spring, Summer) xix) Cancellation Date : The date on which cancelation request initiated for booked/ reserved room.
xx) CXL User ID : Person Id (receiption desk person/ individual) who is responsible to cancle the booking/ reservation room.
i) Date : Date of the occupancy
ii) Day : Day associated with date
iii) Rooms : Total number of the rooms in hotel
iv) OOO : OOO (Rooms that are in Out Of Order) is typically used when a room is being renovated, undergoing repairs and cannot be used.
v) StayOver : The guest is not expected to check out today and will remain at least one more night or may be week.
vi) Arrivals : Displays all guests arriving on a certain day or within a certain set time period.
vii) DueOut : The room is expected to become vacant after the following day’s checkout time.
viii) Available : Total numbers of available rooms (-ve sign indicates overdue)
ix) Group Block : Booking the rooms in a group or package (Ex. For event, people book multiple rooms)
x) Group Picked Up : Booking the rooms in a group.
xi) TransNGTD : NA.
*GTD: Guaranteed. Prior to a function, the figure given by a meeting planner to the property for the number of persons to be served.
xii) TransGTD : NA.
*GTD: Guaranteed. Prior to a function, the figure given by a meeting planner to the property for the number of persons to be served.
xiii) Occupied : Total number of rooms are in used/ occupied (Guest is currently registered to the room). [Rooms - Available - GroupBlock + GroupPickedUp]
xiv) OccPercentage : Room coocupany rate on perticular day (occupied rooms / total available rooms). The occupancy rate will tell you exactly how full your hotel is at any point in time.
xv) RoomRev : Total revenue generated from occupied rooms on perticulaar day
xvi) RevPAR : Revenue Per Available Rooms (total revenue / total available rooms). This metric provides a glimpse into the number of rooms that are being sold at a hotel and how much revenue is being generated from those bookings. You should use RevPAR to understand the best way to maximise the revenue generated per room. If the RevPAR of your property is increasing, it must mean your average room rate or occupancy rate is increasing – or both!
xvii) ADR : Average daily rate of guest on perticular day (total revenue / rooms occupied). it is calculated using the amount of revenue earned and the number of rooms sold to give you an average rate. [Ref: https://www.siteminder.com/r/calculate-revpar/]
1) ADR will simply tell you how much revenue each sold room is selling for on average, while RevPAR will tell you how much revenue you’re bringing in for all your rooms.
2) Since ADR simply indicates the price of your rooms while RevPAR will tell you how much money you yield from each room, sold or not.
viii) Ppl : NA.
**Formulaes Details - Traditionally, the effectiveness of revenue management strategy is measured using the following KPIs
a) Occupancy rate — the number of occupied rental units at a given time, compared to the total number of available rental units at that time.
Occupancy rate = Rooms sold / Room available
b) Average daily rate (ADR) — this rate is applied to a room’s average rental income during a certain period. It’s compared to hotel’s historical ADR or competitors’ metrics.
ADR = Rooms revenue earned / Number of rooms sold
c) Revenue per available room (RevPAR) — a KPI that assess financial and business performance of a hotel. RevPAR measures ability of a property to fill all the rooms and define the best price for them. There are two ways to calculate it:
RevPAR = Rooms revenue / Rooms available
RevPAR = Average daily rate * Occupancy rate
d) Total revenue per available room (TRevPAR) — a metrics that accesses total revenue, generated by property and based on room cost and money spent on it. This KPI captures a snapshot of overall business performance. TRevPAR is one of the main benchmarking tools for big hotels and resorts.
The higher the TRevPAR, — the better the revenue.
TRevPAR = Total revenue / Total number of available rooms
e) Net revenue per available room (NRevPAR) — a KPI that allows hotel revenue managers to calculate the distribution cost to see how the room revenue is generated. NRevPAR includes spending on marketing and distribution.
NRevPAR = (Room revenue — distribution costs) / Number of available rooms
f) Gross operating profit per available room (GOPPAR) — measures the profit of a hotel and value of all assets at any given time. GOPPAR measures profit to capacity, including all a hotel’s spending and taxes.
GOPPAR = Gross Operating Profit / Number of available rooms
Ref: https://www.siteminder.com/r/calculate-revpar/ | https://www.upstay.tech/hotel-revenue-management-formulas-kpis-calculations-use-cases/
'''
Library and packages
'''
Importing the data '.txt' from the source and save in '.csv' format
'''
Reservation data and Occupancy data
'''
As we have the raw data and need to analyze, prepare for business case studies. This stage involves data preprocessing, data cleaning and data understanding.
i) Column rename
ii) Conversion of dataypes
iii) Replacenemt of unnecessary data values and EDA
iv) Missing value analysis
v) Date param extractions
vi) Data transformations and experiments
vii) Data insights
'''
Size of reservation data
'''
Dimenssion of Reservation data : (40613, 7)
Observation:
'''
Number of unique data categories
'''
RateCode 103 Type 8 Source 3 TrackCode 4 dtype: int64
Observations:
In given property data,
'''
Type of rooms
'''
Observations:
'''
Type of rate codes
'''
Observations:
'''
Type of track code
'''
Observations:
'''
Total nights or stays analysis from data
'''
| Night (stays) | 1 | 2 | 5 | 3 | 4 | 6 | 7 | 8 |
|---|---|---|---|---|---|---|---|---|
| Total counts | 25795.00 | 6923.00 | 2549.00 | 2339.00 | 2043.00 | 309.00 | 187.00 | 82.0 |
| % counts | 63.51 | 17.05 | 6.28 | 5.76 | 5.03 | 0.76 | 0.46 | 0.2 |
Observations:
i) Column rename
ii) Conversion of dataypes
iii) Replacenemt of unnecessary data values
iv) Missing value analysis
v) Date param extractions
vi) Data transformations and experiments
vii) Data insights and EDA
'''
Occupancy data size
'''
Dimenssion of Occupancy data : (2922, 18)
Observation:
Observation:
We have 4 dataypes, and for machine learning, we need all inputs in numeric format strickly
Column Name Datatype format
-------------------------------
Date datetime
Year int
Month int
Days int
Nights int
Type object
TrackCode object
RateCode object
Rate float
Occupied int
ADR float
'''
Below table is the yield data representation
'''
| OccupancyLevel | 0Day | 1Day | 2Day | 7Day | 15Day | 30Day | 60Day | 90Day | MaxDay | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0-10 | -0.20 | -0.17 | -0.15 | -0.12 | -0.10 | -0.07 | -0.05 | -0.02 | 0.02 |
| 1 | 11-20 | -0.17 | -0.15 | -0.12 | -0.10 | -0.07 | -0.05 | -0.02 | 0.02 | 0.05 |
| 2 | 21-30 | -0.15 | -0.12 | -0.10 | -0.07 | -0.05 | -0.02 | 0.02 | 0.05 | 0.07 |
| 3 | 31-40 | -0.12 | -0.10 | -0.07 | -0.05 | -0.02 | 0.02 | 0.05 | 0.07 | 0.10 |
| 4 | 41-50 | -0.10 | -0.07 | -0.05 | -0.02 | 0.02 | 0.05 | 0.07 | 0.10 | 0.12 |
| 5 | 51-60 | -0.07 | -0.05 | -0.02 | 0.02 | 0.05 | 0.07 | 0.10 | 0.12 | 0.15 |
| 6 | 61-70 | -0.05 | -0.02 | 0.02 | 0.05 | 0.07 | 0.10 | 0.12 | 0.15 | 0.17 |
| 7 | 71-80 | -0.02 | 0.02 | 0.05 | 0.07 | 0.10 | 0.12 | 0.15 | 0.17 | 0.20 |
| 8 | 81-90 | 0.02 | 0.05 | 0.07 | 0.10 | 0.12 | 0.15 | 0.17 | 0.20 | 0.20 |
| 9 | 91-100 | 0.05 | 0.07 | 0.10 | 0.12 | 0.15 | 0.17 | 0.20 | 0.20 | 0.20 |
Here, we are adding up the holidays in US region from python the package.
Package name : holidays
Desciption : US federal holiday calendar
ref: https://pypi.org/project/holidays/
Note: Initially we checking out (1) how much model learn/ capture this holiday effect? (2) And there is price change with this effect or not? (3) Holiday data is very very low as compared to actual data length, but with this initial data, we are trying to see is chance is happend or not?
$\color{Blue}{\text{,,}}$
'''
Holiday dates and descriptions
'''
2022-01-01 - New Year's Day 2022-01-17 - Martin Luther King Jr. Day 2022-02-21 - Washington's Birthday 2022-05-30 - Memorial Day 2022-06-19 - Juneteenth National Independence Day 2022-06-20 - Juneteenth National Independence Day (Observed) 2022-07-04 - Independence Day 2022-09-05 - Labor Day 2022-10-10 - Columbus Day 2022-11-11 - Veterans Day 2022-11-24 - Thanksgiving 2022-12-25 - Christmas Day 2022-12-26 - Christmas Day (Observed)
Obsercation:
Here, we are extracting Day of week, weekend, month name, day name, week name, etc... from Date column.
i) Extraction of data parameters
ii) Pre-processing data
iii) Validate data quality
iV) Analyzing the data and EDA
Note: Direct Date (datatype: datetimens) is not handle by ML model. Machine Learning algortihm expects data is in numeric form. Hence we are extracting requied fields from Date directly.
Currently taking event data from web portal and save event data from date range 2017 to 2027 in excel file.
ref: https://www.timeanddate.com/holidays/us/super-bowl
Ex. (1) With reference link, in year 2022 below are the major sporting events.
13 Feb Sunday Super Bowl Sporting event
18 Apr Monday Boston Marathon Sporting event
6 May Friday Kentucky Oaks Sporting event
7 May Saturday Kentucky Derby Sporting event
21 May Saturday Preakness Stakes Sporting event
11 Jun Saturday Belmont Stakes Sporting event
6 Nov Sunday New York City Marathon Sporting event
Note: Initially we checking out (1) how much model learn/ capture this effect? (2) And there is price change with this effect or not? (3) Event data is very very low as compared to actual data length, but with this initial data, we are trying to see is chance is happend or not?
As Machien Learning algorithms expects all inputs must be in numeric format, hence in this step, we are converting non-numeric (Obejct) form to numeric (int) using AI techniques.
'''
Encode the inputs (Objects) into machine learning algorithm format (Numerical)
'''
In this section, we are applying Machine Learning algorithms to our data. The business case is related to prediction, hence applying ML Regression Algorithms and see the results on given data.
Input features : Year, Month, Days, Nights, Type, RateCode, TrackCode, Occupied
Output feature : Rate
Models : Linear model, Decision model, Boosting model
ML algorithms : LinearRegression, SVR, RandomForestRegression, GradientBoostingRegressor, XGBRegressor, KNeighborsRegressor
Matrics : MAE, RMSE, R2
Strategy - Considering 100% scale, about 70% data is use for model training purpose and remaining 30% data is use for testing purpose. So model is build on 70% of data and we will test the model (how model is being perform/ model behaviour) on 30% data which is unseen by model. This section is base model
'''
Load the Machine Learning algorithms and its functionalities
'''
Models :
Cross Validation :
Accuracy measures :
Error measure :
Keypoints - For best model performance, below are the points to be consider for model selection as
ref: https://scikit-learn.org/stable/model_selection.html#model-selection
'''
Machine Learning Model 1
'''
R2 Score (train) : 0.34630383021186495 R2 Score (test) : 0.33270983935782483 CV Scores mean : 0.34086520824731464 MAE rate : 16.665660922178073 RMSE rate : 22.160233192867548
Wall time: 331 ms
| Actual | Prediction | Diff | |
|---|---|---|---|
| 7782 | 85.00 | 77.371392 | 7.63 |
| 21242 | 131.48 | 116.950827 | 14.53 |
| 21462 | 88.14 | 83.767027 | 4.37 |
| 21089 | 65.00 | 97.052979 | -32.05 |
| 5323 | 79.52 | 85.274228 | -5.75 |
| ... | ... | ... | ... |
| 27882 | 65.00 | 68.494975 | -3.49 |
| 36286 | 173.63 | 109.512169 | 64.12 |
| 1636 | 110.00 | 93.677805 | 16.32 |
| 16428 | 71.25 | 97.396183 | -26.15 |
| 28192 | 110.00 | 91.062053 | 18.94 |
11928 rows × 3 columns
'''
Machine Learning Model 2
'''
R2 Score (train) : 0.9689863876823992 R2 Score (test) : 0.7665664825184898 CV Scores mean : 0.7784322941704107 MAE rate : 6.484943245463552 RMSE rate : 13.106857349102377
Wall time: 1.14 s
| Actual | Prediction | Diff | |
|---|---|---|---|
| 7782 | 85.00 | 85.00 | 0.00 |
| 21242 | 131.48 | 159.35 | -27.87 |
| 21462 | 88.14 | 83.16 | 4.98 |
| 21089 | 65.00 | 65.00 | 0.00 |
| 5323 | 79.52 | 79.52 | 0.00 |
| ... | ... | ... | ... |
| 27882 | 65.00 | 65.00 | 0.00 |
| 36286 | 173.63 | 159.08 | 14.55 |
| 1636 | 110.00 | 110.00 | 0.00 |
| 16428 | 71.25 | 74.25 | -3.00 |
| 28192 | 110.00 | 90.00 | 20.00 |
11928 rows × 3 columns
'''
Machine Learning Model 3
'''
R2 Score (train) : 0.9555923663801043 R2 Score (test) : 0.8376634486308019 CV Scores mean : 0.8488689425501071 MAE rate : 5.769247162198223 RMSE rate : 10.930124868497426
Wall time: 1min 30s
| Actual | Prediction | Diff | |
|---|---|---|---|
| 7782 | 85.00 | 86.211429 | -1.21 |
| 21242 | 131.48 | 134.489800 | -3.01 |
| 21462 | 88.14 | 80.549300 | 7.59 |
| 21089 | 65.00 | 65.000000 | 0.00 |
| 5323 | 79.52 | 80.319000 | -0.80 |
| ... | ... | ... | ... |
| 27882 | 65.00 | 65.000000 | 0.00 |
| 36286 | 173.63 | 155.635857 | 17.99 |
| 1636 | 110.00 | 110.000000 | 0.00 |
| 16428 | 71.25 | 70.888233 | 0.36 |
| 28192 | 110.00 | 100.425579 | 9.57 |
11928 rows × 3 columns
'''
Machine Learning Model 4
'''
R2 Score (train) : 0.8902537061028056 R2 Score (test) : 0.8380872131169698 CV Scores mean : 0.8469743090551317 MAE rate : 6.881772223433578 RMSE rate : 10.915849508508083
Wall time: 20.8 s
| Actual | Prediction | Diff | |
|---|---|---|---|
| 7782 | 85.00 | 82.901123 | 2.10 |
| 21242 | 131.48 | 135.089218 | -3.61 |
| 21462 | 88.14 | 84.351509 | 3.79 |
| 21089 | 65.00 | 67.059540 | -2.06 |
| 5323 | 79.52 | 82.265274 | -2.75 |
| ... | ... | ... | ... |
| 27882 | 65.00 | 64.755722 | 0.24 |
| 36286 | 173.63 | 145.572418 | 28.06 |
| 1636 | 110.00 | 109.498741 | 0.50 |
| 16428 | 71.25 | 70.389725 | 0.86 |
| 28192 | 110.00 | 110.772255 | -0.77 |
11928 rows × 3 columns
'''
Machine Learning Model 5
'''
R2 Score (train) : 0.8389724978820228 R2 Score (test) : 0.7427346785511768 CV Scores mean : 0.7560069694961951 MAE rate : 7.89223373574782 RMSE rate : 13.759656428712793
Wall time: 6.67 s
| Actual | Prediction | Diff | |
|---|---|---|---|
| 7782 | 85.00 | 67.000 | 18.00 |
| 21242 | 131.48 | 124.796 | 6.68 |
| 21462 | 88.14 | 82.488 | 5.65 |
| 21089 | 65.00 | 66.000 | -1.00 |
| 5323 | 79.52 | 80.968 | -1.45 |
| ... | ... | ... | ... |
| 27882 | 65.00 | 65.000 | 0.00 |
| 36286 | 173.63 | 156.170 | 17.46 |
| 1636 | 110.00 | 110.000 | 0.00 |
| 16428 | 71.25 | 70.296 | 0.95 |
| 28192 | 110.00 | 100.200 | 9.80 |
11928 rows × 3 columns
'''
Machine Learning Model 6
'''
R2 Score (train) : 0.7625874320462578 R2 Score (test) : 0.7496412567990661 CV Scores mean : 0.7571152486849361 MAE rate : 9.701804042791437 RMSE rate : 13.57370314445734
Wall time: 28.5 s
| Actual | Prediction | Diff | |
|---|---|---|---|
| 7782 | 85.00 | 73.478236 | 11.52 |
| 21242 | 131.48 | 137.180354 | -5.70 |
| 21462 | 88.14 | 77.915852 | 10.22 |
| 21089 | 65.00 | 66.469882 | -1.47 |
| 5323 | 79.52 | 88.469300 | -8.95 |
| ... | ... | ... | ... |
| 27882 | 65.00 | 62.382809 | 2.62 |
| 36286 | 173.63 | 129.512482 | 44.12 |
| 1636 | 110.00 | 110.533133 | -0.53 |
| 16428 | 71.25 | 74.875591 | -3.63 |
| 28192 | 110.00 | 102.208249 | 7.79 |
11928 rows × 3 columns
We are comparing the model performaces, so that best model we can choose for tunning and predictions.
Table columns:
ref: (Cross validation) https://scikit-learn.org/stable/modules/cross_validation.html#cross-validation
'''
Machine Learning Model performanance analysis
'''
| ML Algorithms | Training Score | CV Mean Score | Testing score | MAE Rate | MSE Rate | RMSE Rate | |
|---|---|---|---|---|---|---|---|
| 0 | XGBRegressor(base_score=0.5, booster='gbtree',... | 0.890254 | 0.846974 | 0.838087 | 6.881772 | 119.155770 | 10.915850 |
| 1 | (DecisionTreeRegressor(max_features='auto', ra... | 0.955592 | 0.848869 | 0.837663 | 5.769247 | 119.467630 | 10.930125 |
| 2 | DecisionTreeRegressor() | 0.968986 | 0.778432 | 0.766566 | 6.484943 | 171.789710 | 13.106857 |
| 3 | ([DecisionTreeRegressor(criterion='friedman_ms... | 0.762587 | 0.757115 | 0.749641 | 9.701804 | 184.245417 | 13.573703 |
| 4 | KNeighborsRegressor() | 0.838972 | 0.756007 | 0.742735 | 7.892234 | 189.328145 | 13.759656 |
| 5 | LinearRegression() | 0.346304 | 0.340865 | 0.332710 | 16.665661 | 491.075935 | 22.160233 |
Observations:
Note: As top 2 models are XGBRegressor + RandomForestRgressor. Previously we got 80% accuracy, here we improve accuracy about 3% to 4%. For finding best situable parameter, we need to perform hyperparameter tunning and using this we can enhance our model with best results.
'''
Selection of model
'''
Experiment on instance / data balancing and then approaching towards the modeling
ref: https://imbalanced-learn.org/stable/references/over_sampling.html
'''
Class distrubution in data (before applying ML technique)
'''
Class=3, n=16211 (40.773%) Class=6, n=3765 (9.470%) Class=4, n=403 (1.014%) Class=2, n=11442 (28.778%) Class=7, n=2942 (7.400%) Class=0, n=841 (2.115%) Class=5, n=2791 (7.020%) Class=1, n=1364 (3.431%)
'''
Class distrubution in data (after applying ML technique)
'''
Class=3, n=16211 (12.500%) Class=2, n=16211 (12.500%) Class=5, n=16211 (12.500%) Class=6, n=16211 (12.500%) Class=0, n=16211 (12.500%) Class=4, n=16211 (12.500%) Class=7, n=16211 (12.500%) Class=1, n=16211 (12.500%)
'''
Machine Learning Model 1
'''
Training score : 0.8684556564188495 Testing score : 0.8443767419186052 CV score : 0.8422003719574469 MAE score : 6.870559424361396 RMSE score : 10.484148300350673 Wall time: 42.6 s
'''
Machine Learning Model 2
'''
Training score : 0.9791126387064626 Testing score : 0.8996065766649173 CV score : 0.9005385275721396 MAE score : 4.176186507389356 RMSE score : 8.420706599781175 Wall time: 4min 5s
'''
Machine Learning Model 3
'''
Training score : 0.9040731873366943 Testing score : 0.8510408557673945 CV score : 0.8540658026092762 MAE score : 5.123515466098716 RMSE score : 10.257215693268932 Wall time: 22.1 s
'''
Machine Learning Model 4
'''
Training score : 0.7448732906490182 Testing score : 0.7443680326819571 CV score : 0.7421864599161211 MAE score : 9.585658341817027 RMSE score : 13.437036132570164 Wall time: 1min 3s
'''
Class distrubution in data (before applying ML technique)
'''
Class=3, n=16211 (40.773%) Class=2, n=11442 (28.778%) Class=5, n=2791 (7.020%) Class=6, n=3765 (9.470%) Class=0, n=841 (2.115%) Class=4, n=403 (1.014%) Class=7, n=2942 (7.400%) Class=1, n=1364 (3.431%)
'''
Class distrubution in data (after applying ML technique)
'''
Class=3, n=16211 (12.516%) Class=2, n=14852 (11.467%) Class=5, n=16198 (12.506%) Class=6, n=17392 (13.428%) Class=0, n=16029 (12.376%) Class=4, n=16229 (12.530%) Class=7, n=16408 (12.669%) Class=1, n=16199 (12.507%)
'''
Machine Learning Model 1
'''
Training score : 0.8644159717386192 Testing score : 0.8407441168496488 CV score : 0.8405205874958405 MAE score : 6.957623607143605 RMSE score : 10.56443888401877 Wall time: 41.4 s
'''
Machine Learning Model 2
'''
Training score : 0.9797763066898475 Testing score : 0.8961332727735315 CV score : 0.8983456635734269 MAE score : 4.260000135293809 RMSE score : 8.531726332714456 Wall time: 4min 14s
'''
Machine Learning Model 3
'''
Training score : 0.9035822618773521 Testing score : 0.8471004913514728 CV score : 0.8513955067761726 MAE score : 5.247309153163161 RMSE score : 10.35146308093822 Wall time: 20.2 s
'''
Machine Learning Model 4
'''
Training score : 0.7413250639910955 Testing score : 0.7405589649876201 CV score : 0.7391001267565829 MAE score : 9.614271697577953 RMSE score : 13.4839777357165 Wall time: 1min 4s
'''
Class distrubution in data (before applying ML technique)
'''
Class=3, n=16211 (40.773%) Class=6, n=3765 (9.470%) Class=4, n=403 (1.014%) Class=2, n=11442 (28.778%) Class=7, n=2942 (7.400%) Class=0, n=841 (2.115%) Class=5, n=2791 (7.020%) Class=1, n=1364 (3.431%)
'''
Class distrubution in data (after applying ML technique)
'''
Class=0, n=12720 (17.964%) Class=1, n=11036 (15.586%) Class=2, n=5490 (7.753%) Class=3, n=5115 (7.224%) Class=4, n=13839 (19.544%) Class=5, n=8119 (11.466%) Class=6, n=6863 (9.692%) Class=7, n=7627 (10.771%)
'''
Machine Learning Model 1
'''
Training score : 0.896896815813323 Testing score : 0.8573166433423051 CV score : 0.8609984949796285 MAE score : 6.602359213402049 RMSE score : 9.904559634728031 Wall time: 32.7 s
'''
Machine Learning Model 2
'''
Training score : 0.98878698931307 Testing score : 0.9279624696346337 CV score : 0.9330980893280749 MAE score : 3.2908613046581094 RMSE score : 7.037654143748272 Wall time: 3min 38s
'''
Machine Learning Model 3
'''
Training score : 0.9338166184607831 Testing score : 0.8882001284225913 CV score : 0.8982719259407673 MAE score : 4.072782084873158 RMSE score : 8.752981282748028 Wall time: 8.79 s
'''
Machine Learning Model 4
'''
Training score : 0.7384125594802402 Testing score : 0.7362586980118333 CV score : 0.7350259363501742 MAE score : 9.799839233587337 RMSE score : 13.443873593191544 Wall time: 36.4 s
'''
Machine Learning Model performanance analysis (balanced)
'''
RFR
| model | resample | trainingScore | testingScore | cvScore | maeRate | rmseRate | |
|---|---|---|---|---|---|---|---|
| 0 | RFR | SMOTEENN | 0.988494 | 0.930876 | 0.933146 | 3.247518 | 6.882581 |
| 1 | RFR | SMOTE | 0.979113 | 0.899607 | 0.900539 | 4.176187 | 8.420707 |
| 2 | RFR | ADASYN | 0.979776 | 0.896133 | 0.898346 | 4.260000 | 8.531726 |
| 3 | KNN | SMOTEENN | 0.933817 | 0.888200 | 0.898272 | 4.072782 | 8.752981 |
| 4 | XGB | SMOTEENN | 0.898001 | 0.862799 | 0.861766 | 6.473563 | 9.696487 |
| 5 | KNN | SMOTE | 0.904073 | 0.851041 | 0.854066 | 5.123515 | 10.257216 |
| 6 | KNN | ADASYN | 0.903582 | 0.847100 | 0.851396 | 5.247309 | 10.351463 |
| 7 | XGB | SMOTE | 0.868456 | 0.844377 | 0.842200 | 6.870559 | 10.484148 |
| 8 | XGB | ADASYN | 0.864416 | 0.840744 | 0.840521 | 6.957624 | 10.564439 |
| 9 | GBR | SMOTE | 0.744873 | 0.744368 | 0.742186 | 9.585658 | 13.437036 |
| 10 | GBR | ADASYN | 0.741325 | 0.740559 | 0.739100 | 9.614272 | 13.483978 |
| 11 | GBR | SMOTEENN | 0.738413 | 0.736259 | 0.735026 | 9.799839 | 13.443874 |
Observation:
We are comparing the model performaces, so that best model we can choose for next evaluation stages (tunning / predictions).
After data balacing mechanisms, we have seen the improvemtents in model performances as compared to the without effect of the balacing. But in this techniques there are the chances of data points duplications and removals, as for balacing the data, different AI algorithm techniques uses its own procedures (Ex. targeting large values, distance based similarities, etc...)
Based on trials (previous results), below is the results for given data.
Sr. No Algorithm Is balanced Model accuracy Error rate
-------------------------------------------------------------------------------
01 XGB No 84 10
02 RFR No 85 10
03 XGB Yes 87 8
04 RFR Yes 92 6
Let's analyze how is the model perform in combination of event and holiday in past data (testing dataset). So that we can know the effect of those combinations is work or not.
Note: Every time we need to pass new data index and new value manually which is present in testing data. (because we previously used shuffle method.)
Holiday consideration (previous obs)
If alone there is event, price = 73.95 .....(impacted with event only +2)
[71.8224] ........dataframe -> holiday (1) + event (0) [71.8224] ........holiday (1) + event (0) [73.9591] ........holiday (0) + event (1) [73.9961] ........holiday (0) + event (0) [71.7774] ........holiday (1) + event (1)
Event consideration (previous obs)
If alone there is holiday, price = 71.61 .....(not much impacted with holiday only +1)
[71.8224] ........dataframe -> holiday (1) + event (0) [71.8224] ........holiday (1) + event (0) [73.9591] ........holiday (0) + event (1) [73.9961] ........holiday (0) + event (0) [71.7774] ........holiday (1) + event (1)
Business days consideration (previous obs)
If there is event and holiday, price = 92.58 ...... (impacted with event and holiday +9)
[71.8224] ........dataframe -> holiday (1) + event (0) [71.8224] ........holiday (1) + event (0) [73.9591] ........holiday (0) + event (1) [73.9961] ........holiday (0) + event (0) [71.7774] ........holiday (1) + event (1)
Holiday and event consideration (previous obs)
If no holiday and no event, price = 73.13 .....(impacted with both +1, but effect is smaller than alone event)
[71.8224] ........dataframe -> holiday (1) + event (0) [71.8224] ........holiday (1) + event (0) [73.9591] ........holiday (0) + event (1) [73.9961] ........holiday (0) + event (0) [71.7774] ........holiday (1) + event (1)
Observation: On multiple time simulation in historical test data, below are the observation,
This section proveides the information about which are the best parameters for given data and algorithm, so that we can improve the results from previous abse model. In base model, model parameter are consist of default values and data is fitted using default param. But in hyperparameter tunning, the searching algorithm finds the best situalble paramts rather than default one. So that using these best param, we can improve the results of model.
Hyperparameter tuning is an essential part of controlling the behavior of a machine learning model. If we don’t correctly tune our hyperparameters, our estimated model parameters produce suboptimal results, as they don’t minimize the loss function. This means our model makes more errors.
Methods of tunning:
1) Grid search
2) Random Search
3) Bayesian optimizations
ref: https://www.anyscale.com/blog/what-is-hyperparameter-tuning
Note: Commenting this while section as it takes all available machine CPU cores (all physical cores), avaliable memory (above 98% RAM) and lots of time (multiple hours depending upon the parameter define, its all combinations and total passed data size)
'''
Take lot of time to as per resource available
'''
This section provides the simulation based on user inputs. User need to submit inputs and based on inputs, model is trying to predict the Room Rate.
'''
User inputs on date - 03 May 2022
'''
User input accepted [Arrival date] User input accepted [depart date] User input accepted [Room type] User input accepted [Rate code] User input accepted [Track Code] ------------------------------------------------------------ Current date (today) : 2022-05-03 00:00:00 Check-in date : 2022-05-07 00:00:00 Check-out date : 2022-05-08 00:00:00 ------------------------------------------------------------ Expected check-in days : 4 Number of nights (stay) : 1 Occupancy rate (today) : 52 Occupancy rate (arrival) : 73 ADR rate (today) : 81.74 ADR rate (arrival) : 132.13 ------------------------------------------------------------ Room type : NQQ Rate code : BAR Track code : LEISURE
'''
Data of current date where user is booking the room (current day data)
'''
current year : 2022 current month : 5 current day : 3 current day week : 1 current day name : Tuesday current month name : May current weekend status : 0 current holiday status : 0 current holiday name : NA current event status : 0 current event name : NA
'''
Below is the predictions for next days
'''
(94, 17)
| Date | Month | Day | Nights | RoomType | RateCode | TrackCode | Holiday | Event | OccupancyLevel | PredictedPrice | YieldPrice | ADR | YieldRatio | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2022-05-03 | 5 | Tuesday | 1 | NQQ | BAR | LEISURE | 0 | 0 | 52 | 102.77 | 95.57 | 81.74 | -0.07 |
| 1 | 2022-05-04 | 5 | Wednesday | 1 | NQQ | BAR | LEISURE | 0 | 0 | 46 | 102.77 | 92.49 | 85.93 | -0.10 |
| 2 | 2022-05-05 | 5 | Thursday | 1 | NQQ | BAR | LEISURE | 0 | 0 | 56 | 102.77 | 95.57 | 89.55 | -0.07 |
| 3 | 2022-05-06 | 5 | Friday | 1 | NQQ | BAR | LEISURE | 0 | 1 | 81 | 102.89 | 104.95 | 137.53 | 0.02 |
| 4 | 2022-05-07 | 5 | Saturday | 1 | NQQ | BAR | LEISURE | 0 | 1 | 73 | 102.71 | 109.90 | 132.13 | 0.07 |
| 5 | 2022-05-08 | 5 | Sunday | 1 | NQQ | BAR | LEISURE | 0 | 0 | 13 | 101.55 | 91.39 | 79.98 | -0.10 |
| 6 | 2022-05-09 | 5 | Monday | 1 | NQQ | BAR | LEISURE | 0 | 0 | 10 | 101.55 | 89.37 | 78.45 | -0.12 |
| 7 | 2022-05-10 | 5 | Tuesday | 1 | NQQ | BAR | LEISURE | 0 | 0 | 5 | 106.12 | 93.38 | 84.90 | -0.12 |
| 8 | 2022-05-11 | 5 | Wednesday | 1 | NQQ | BAR | LEISURE | 0 | 0 | 3 | 106.12 | 93.38 | 106.26 | -0.12 |
| 9 | 2022-05-12 | 5 | Thursday | 1 | NQQ | BAR | LEISURE | 0 | 0 | 5 | 106.12 | 93.38 | 107.18 | -0.12 |
| 10 | 2022-05-13 | 5 | Friday | 1 | NQQ | BAR | LEISURE | 0 | 0 | 11 | 101.55 | 91.39 | 160.30 | -0.10 |
| 11 | 2022-05-14 | 5 | Saturday | 1 | NQQ | BAR | LEISURE | 0 | 0 | 12 | 101.55 | 91.39 | 156.28 | -0.10 |
| 12 | 2022-05-15 | 5 | Sunday | 1 | NQQ | BAR | LEISURE | 0 | 0 | 5 | 106.09 | 93.36 | 105.71 | -0.12 |
| 13 | 2022-05-16 | 5 | Monday | 1 | NQQ | BAR | LEISURE | 0 | 0 | 6 | 106.12 | 93.38 | 102.51 | -0.12 |
| 14 | 2022-05-17 | 5 | Tuesday | 1 | NQQ | BAR | LEISURE | 0 | 0 | 7 | 106.27 | 93.52 | 104.74 | -0.12 |
| 15 | 2022-05-18 | 5 | Wednesday | 1 | NQQ | BAR | LEISURE | 0 | 0 | 7 | 106.41 | 93.64 | 114.15 | -0.12 |
| 16 | 2022-05-19 | 5 | Thursday | 1 | NQQ | BAR | LEISURE | 0 | 0 | 13 | 102.71 | 92.44 | 102.97 | -0.10 |
| 17 | 2022-05-20 | 5 | Friday | 1 | NQQ | BAR | LEISURE | 0 | 0 | 9 | 102.69 | 90.37 | 131.24 | -0.12 |
| 18 | 2022-05-21 | 5 | Saturday | 1 | NQQ | BAR | LEISURE | 0 | 1 | 6 | 107.19 | 94.33 | 150.29 | -0.12 |
| 19 | 2022-05-22 | 5 | Sunday | 1 | NQQ | BAR | LEISURE | 0 | 0 | 5 | 107.19 | 94.32 | 96.99 | -0.12 |
| 20 | 2022-05-23 | 5 | Monday | 1 | NQQ | BAR | LEISURE | 0 | 0 | 5 | 107.21 | 94.34 | 74.53 | -0.12 |
| 21 | 2022-05-24 | 5 | Tuesday | 1 | NQQ | BAR | LEISURE | 0 | 0 | 5 | 107.21 | 94.35 | 78.99 | -0.12 |
| 22 | 2022-05-25 | 5 | Wednesday | 1 | NQQ | BAR | LEISURE | 0 | 0 | 4 | 107.21 | 94.35 | 85.28 | -0.12 |
| 23 | 2022-05-26 | 5 | Thursday | 1 | NQQ | BAR | LEISURE | 0 | 0 | 2 | 107.21 | 94.35 | 64.90 | -0.12 |
| 24 | 2022-05-27 | 5 | Friday | 1 | NQQ | BAR | LEISURE | 0 | 0 | 4 | 107.21 | 94.35 | 116.73 | -0.12 |
| 25 | 2022-05-28 | 5 | Saturday | 1 | NQQ | BAR | LEISURE | 0 | 0 | 6 | 107.21 | 94.35 | 135.59 | -0.12 |
| 26 | 2022-05-29 | 5 | Sunday | 1 | NQQ | BAR | LEISURE | 0 | 0 | 5 | 107.21 | 94.35 | 104.31 | -0.12 |
| 27 | 2022-05-30 | 5 | Monday | 1 | NQQ | BAR | LEISURE | 1 | 0 | 4 | 107.21 | 94.35 | 83.47 | -0.12 |
| 28 | 2022-05-31 | 5 | Tuesday | 1 | NQQ | BAR | LEISURE | 0 | 0 | 2 | 107.21 | 94.35 | 103.33 | -0.12 |
| 29 | 2022-06-01 | 6 | Wednesday | 1 | NQQ | BAR | LEISURE | 0 | 0 | 2 | 106.08 | 93.35 | 82.28 | -0.12 |
| 30 | 2022-06-02 | 6 | Thursday | 1 | NQQ | BAR | LEISURE | 0 | 0 | 30 | 101.52 | 94.42 | 112.19 | -0.07 |
| 31 | 2022-06-03 | 6 | Friday | 1 | NQQ | BAR | LEISURE | 0 | 0 | 39 | 100.78 | 95.74 | 150.65 | -0.05 |
| 32 | 2022-06-04 | 6 | Saturday | 1 | NQQ | BAR | LEISURE | 0 | 0 | 55 | 102.70 | 104.76 | 153.92 | 0.02 |
| 33 | 2022-06-05 | 6 | Sunday | 1 | NQQ | BAR | LEISURE | 0 | 0 | 10 | 101.50 | 89.32 | 108.64 | -0.12 |
| 34 | 2022-06-06 | 6 | Monday | 1 | NQQ | BAR | LEISURE | 0 | 0 | 4 | 106.08 | 93.35 | 112.46 | -0.12 |
| 35 | 2022-06-07 | 6 | Tuesday | 1 | NQQ | BAR | LEISURE | 0 | 0 | 5 | 106.08 | 93.35 | 115.44 | -0.12 |
| 36 | 2022-06-08 | 6 | Wednesday | 1 | NQQ | BAR | LEISURE | 0 | 0 | 1 | 106.08 | 93.35 | 88.70 | -0.12 |
| 37 | 2022-06-09 | 6 | Thursday | 1 | NQQ | BAR | LEISURE | 0 | 0 | 3 | 106.07 | 93.34 | 121.56 | -0.12 |
| 38 | 2022-06-10 | 6 | Friday | 1 | NQQ | BAR | LEISURE | 0 | 0 | 4 | 106.05 | 93.33 | 112.78 | -0.12 |
| 39 | 2022-06-11 | 6 | Saturday | 1 | NQQ | BAR | LEISURE | 0 | 1 | 2 | 106.05 | 93.33 | 147.88 | -0.12 |
In this section, predictions visualization over date factor we can analyze.
'''
Line chart for next days prediction with detail information
'''
Observation:
'''
Line chart for occupancy level with next days predictions
'''
Observation:
'''
Bar chart for monthly averaged prediction price and yield price
'''
| PredictedPrice | YieldPrice | |
|---|---|---|
| Months | ||
| May | 105.104483 | 94.367586 |
| Jun | 104.305667 | 93.252000 |
| Jul | 105.478387 | 93.376129 |
| Aug | 103.765000 | 93.085000 |
Observation:
'''
Bar chart for daily averaged prediction price and yield price
'''
| PredictedPrice | YieldPrice | |
|---|---|---|
| Day | ||
| Monday | 105.143846 | 92.839231 |
| Tuesday | 105.370000 | 93.383571 |
| Wednesday | 104.647857 | 92.817143 |
| Thursday | 104.378571 | 93.382857 |
| Friday | 104.440769 | 94.262308 |
| Saturday | 104.931538 | 95.652308 |
| Sunday | 105.525385 | 93.173077 |
Observation:
'''
Bar chart for Event averaged prediction price and yield price
'''
| PredictedPrice | YieldPrice | |
|---|---|---|
| Event | ||
| 0 | 104.925 | 93.3190 |
| 1 | 104.710 | 100.6275 |
Observation:
'''
Bar chart for holiday averaged prediction price and yield price
'''
| PredictedPrice | YieldPrice | |
|---|---|---|
| Holiday | ||
| 0 | 104.877667 | 93.631556 |
| 1 | 105.775000 | 93.595000 |
Observation: